August 28, 2016

Higgs Event

Exploratory Data Analysis (EDA)

Missingness

  • A special kind of MNAR.

  • Up to 2/3 of observations and 11 columns have missingness

  • Missing are controlled by jet_num and predicted Higgs mass (mass_MMC)

The Hierachical Clustering Pattern

Missingness Continues

  • The jet means the shot gun shower of decay products of quarks/gluons spitting out toward a particular direction.

  • The number of jets controls the type of physics interaction, just like the term SVM, GLM, NN are DIFFERENT models of the same umbrella term MACHINE LEARNING

Higgs Mass Missingness Predicts Signal

  • The predicted Higgs mass is missing when the scattering topology is off the chart, which predicts infeasible Higgs mass, thus the missingness.

The Effect on the Missingness Imputation

  • The methodology on MAR imputation is useless, or even harmful to the prediction

  • We would like to impute the higgs mass, the momentum related variables to zero, AWAY from the variable population range.

  • Extrapolate the other missingness and view jet_num = 0 -> 1 -> 2 -> 3 as the degeneration cases.

  • The Tree learning can handle categorical and continuous variables well, seperating out the imputed missing values naturally. This becomes our top candidate!

The variables are often highly correlated

Logistic Regression

Why Logistic Regression

  • Simple, Fast & Interpretable

  • A working ML pipeline

  • A Performance Baseline

Modeling & Result

Label ~ Original dataset - EventId - Weight

  • 18 significant variables

  • Training set AUC: 0.816

  • Test set AMS: 2.02

Feature Importance - Logistic

Multicollinearity - Logistic

RandomForest

The optimal mtry=3 RF on all the variables lead to Semi-overfit Result

Feature Importance - RandomForest

In Sample After Learning Predictions

  • We spend a ton of effort to understand the reason.

Our Methodology

  • Split the 30 variables into two groups:
    • Variables with missingness
    • Complete variables
  • The same AUC = 1 pattern happens for the 19 variables with NO missingness. Thus we know this has nothing to do with the imputation metholodgy.

  • After the Higgs Mass (with missingness) is removed, the top two important variables are DER_mass_transverse_met_lep, DER_mass_vis

  • Run a RF on these two variables, still produces AUC=1!!!!!

The Perculiar Behavior Suggests Us to Investigate the Variable Prob. Densities

Lepton-Hadronic Tau Phi-Phi angle Prob Density Plot

Lepton-Tau Eta-Eta Prob Density Plot

Explantion of the failure of RF-2 variables

The Test Train Total Prob Densities Vary Hugely

The Prob Densities Vary Hugely (Open Label Study!!!)

Summary

  • RF regression fits the pdfs by piecewise constant functions

  • The Noise between the Test set and Train set is huge, causing the RF's class estimation to be off.

  • It is quite possible the learning problem is too easy in sample. But RF is not able to guess the change of the test - train difference effectively.

XGBoost

Dealing with missing data

  • XGBoost is able to treat missing values properly

  • Less sensitive to noise

  • eXtremely FAST

  • Low memory use

  • Automatic parallel processing

Result

  • 2-fold CV for best parameter
    • metric = AUC + ams@.34
    • max_depth = 6
    • eta = 0.1
    • subsample = .9
    • colsample = .9
  • No feature engineering

  • AMS: 3.5

Brutal Force Search on Threshold

Feature Importance - XGBoost

Issue

Too many hyper-parameters to be tuned

e.g. Cutoff threshold

  • Predicted probablity > threshold ==> 'signal'

  • Predicted probablity <= threshold ==> 'background'

Thank You